Skip to content

GH-563: Make ColumnMetaData.path_in_schema optional#564

Open
etseidl wants to merge 3 commits into
apache:masterfrom
etseidl:deprecate_path_in_schema
Open

GH-563: Make ColumnMetaData.path_in_schema optional#564
etseidl wants to merge 3 commits into
apache:masterfrom
etseidl:deprecate_path_in_schema

Conversation

@etseidl
Copy link
Copy Markdown
Contributor

@etseidl etseidl commented Apr 8, 2026

Rationale for this change

What changes are included in this PR?

Change path_in_schema to optional.

Do these changes have PoC implementations?

Yes.

Closes #563

@etseidl
Copy link
Copy Markdown
Contributor Author

etseidl commented Apr 8, 2026

I hope to have a Java PoC available soon.

@etseidl
Copy link
Copy Markdown
Contributor Author

etseidl commented Apr 8, 2026

Java PoC apache/parquet-java#3470

I've so far confirmed that parquet-cli cat from the Java PoC can read a file lacking path_in_schema generated by arrow-rs.

@etseidl etseidl marked this pull request as ready for review April 9, 2026 17:53
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 10, 2026

I think it is a great idea -- though before merging this I think we should do a formal approval on the mailing list

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for driving this along @etseidl

Comment thread src/main/thrift/parquet.thrift Outdated
@etseidl
Copy link
Copy Markdown
Contributor Author

etseidl commented Apr 10, 2026

I think it is a great idea -- though before merging this I think we should do a formal approval on the mailing list

For sure! 👍 I just wanted to put up a concrete proposal to drive the discussion.

Also, FWIW, I've started on an arrow-cpp PoC. We'll see how far I get 😅

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@etseidl
Copy link
Copy Markdown
Contributor Author

etseidl commented Apr 10, 2026

C++ PoC apache/arrow#49707

Comment thread src/main/thrift/parquet.thrift Outdated
* the schema, and redundantly storing it here can lead to unnecessary
* bloat in the footer. Writers are encouraged to make the writing of
* this field optional, but for maximal compatibility should default to
* writing the field until at least Month 202X.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on "Forward incompatible features/changes should not be turned on by default until 2 years after the parquet-java implementation containing the feature is released." Lets maybe fill in the date as September 2028, assuming we get things merged by a september java release?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone ahead and put Sept 2028 in the text for now. We can update as needed later.

alamb pushed a commit to apache/arrow-rs that referenced this pull request May 7, 2026
# Which issue does this PR close?

none

# Rationale for this change
This is a proof of concept implementation for
apache/parquet-format#563

# What changes are included in this PR?

Since version 57.0.0, this crate has been tolerant of a missing
`path_in_schema`. This PR adds options to cease writing the field as
well. The option defaults to continuing to write the field.

See related discussion on parquet mailing list:
https://lists.apache.org/thread/czm2bk45wwtkhhpqxqvmx9dk5wkwk1kt

# Are these changes tested?

Yes

# Are there any user-facing changes?

No, this only adds an optional behavior change that defaults to no
change

# Related PRs
- apache/parquet-format#563
- apache/parquet-format#564
- apache/parquet-java#3470
@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented May 12, 2026

@etseidl How do we want to move forward?

@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 12, 2026

FWIW we merged an option to not writing path_in_schema (off by default) in the Rust implementation which people can choose to use

@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented May 12, 2026

FWIW we merged an option to not writing path_in_schema (off by default) in the Rust implementation which people can choose to use

Don't we need to change the spec first?

@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 12, 2026

FWIW we merged an option to not writing path_in_schema (off by default) in the Rust implementation which people can choose to use

Don't we need to change the spec first?

Not in my opinion -- see my rationale here:

@etseidl
Copy link
Copy Markdown
Contributor Author

etseidl commented May 12, 2026

@etseidl How do we want to move forward?

@emkornfield mentioned on the M/L that we should give a week for others to comment, but that was back in April (https://lists.apache.org/thread/900503q07v95vyh6fk3qfn7ynb4w6yn2). I think I need to loop back and make a file for parquet-testing, then test it with the 3 PoCs. Then I think we can bring this up for a vote.

@etseidl
Copy link
Copy Markdown
Contributor Author

etseidl commented May 12, 2026

Submitted apache/parquet-testing#108

Verified the file is read by the arrow-cpp PoC

% python
>>> import pyarrow
>>> pyarrow.__version__
'24.0.0.dev298+g24f0f4c9a'
>>> from pyarrow import parquet as pq
>>> df = pq.read_table('src/parquet-testing/data/no_path_in_schema.zstd.parquet')
>>> df
pyarrow.Table
a: map<string, map<int32, bool ('value')> ('a')>
  child 0, a: struct<key: string not null, value: map<int32, bool ('value')>> not null
      child 0, key: string not null
      child 1, value: map<int32, bool ('value')>
          child 0, value: struct<key: int32 not null, value: bool not null> not null
              child 0, key: int32 not null
              child 1, value: bool not null
b: int32 not null
c: double not null
----
a: [[keys:["a"]values:[keys:[1,2]values:[true,false]],keys:["b"]values:[keys:[1]values:[true]],keys:["c"]values:[null],keys:["d"]values:[keys:[]values:[]],keys:["e"]values:[keys:[1]values:[true]],keys:["f"]values:[keys:[3,4,5]values:[true,false,true]]]]
b: [[1,1,1,1,1,1]]
c: [[1,1,1,1,1,1]]

Working on java...parquet-cli doesn't like non-string map keys:

% pqcli cat ~/src/parquet-testing/data/no_path_in_schema.zstd.parquet 
Argument error: Map key type must be binary (UTF8): required int32 key

@etseidl
Copy link
Copy Markdown
Contributor Author

etseidl commented May 12, 2026

I confirmed parquet-cli meta and pages work with the parquet-java PoC.

% parquet-cli pages no_path_in_schema.parquet

Column: a.key_value.key
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  Z _  6       5.00 B     30 B      
  0-1    data  Z R  6       4.33 B     26 B                        


Column: a.key_value.value.key_value.key
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  Z _  5       4.00 B     20 B      
  0-1    data  Z R  9       3.78 B     34 B                        


Column: a.key_value.value.key_value.value
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-0    data  Z _  9       3.33 B     30 B                        


Column: b
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  Z _  1       4.00 B     4 B       
  0-1    data  Z R  6       1.83 B     11 B                        


Column: c
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  Z _  1       8.00 B     8 B       
  0-1    data  Z R  6       1.83 B     11 B

% parquet-cli meta no_path_in_schema.zstd.parquet

File path:  no_path_in_schema.zstd.parquet
Created by: parquet-rs version 58.3.0
Properties:
                               ARROW:schema: /////wgCAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAMAAAB4AAAASAAAABQAAAAQABYAEAAAAA8ABAAAAAgAEAAAABgAAAAcAAAAAAAAAxgAAAAAAAYACAAGAAYAAAAAAAIAAAAAAAEAAABjAAAAxP7//xAAAAAYAAAAAAAAAhQAAAAU////IAAAAAAAAAEAAAAAAQAAAGIAAAC8////GAAAAAwAAAAAAAERSAEAAAEAAAAIAAAA5P7//xD///8cAAAADAAAAAAAAA0YAQAAAgAAAOgAAAAYAAAACP///xAAFAAQAA4ADwAEAAAACAAQAAAAGAAAAAwAAAAAAAERoAAAAAEAAAAIAAAAOP///2T///8cAAAADAAAAAAAAA1wAAAAAgAAADQAAAAIAAAAXP///4j///8UAAAADAAAAAAAAAYMAAAAAAAAAHj///8FAAAAdmFsdWUAAACw////GAAAACAAAAAAAAACHAAAAAgADAAEAAsACAAAACAAAAAAAAABAAAAAAMAAABrZXkACQAAAGtleV92YWx1ZQAAAAUAAAB2YWx1ZQAAABAAFAAQAAAADwAEAAAACAAQAAAAGAAAAAwAAAAAAAAFEAAAAAAAAAAEAAQABAAAAAMAAABrZXkACQAAAGtleV92YWx1ZQAAAAEAAABhAAAA
  org.apache.spark.sql.parquet.row.metadata: {"type":"struct","fields":[{"name":"a","type":{"type":"map","keyType":"string","valueType":{"type":"map","keyType":"integer","valueType":"boolean","valueContainsNull":false},"valueContainsNull":true},"nullable":true,"metadata":{}},{"name":"b","type":"integer","nullable":false,"metadata":{}},{"name":"c","type":"double","nullable":false,"metadata":{}}]}
Schema:
message arrow_schema {
  optional group a (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional group value (MAP) {
        repeated group key_value {
          required int32 key;
          required boolean value;
        }
      }
    }
  }
  required int32 b;
  required double c;
}


Row group 0:  count: 6  58.50 B records  start: 4  total(compressed): 351 B total(uncompressed):270 B 
--------------------------------------------------------------------------------
                                   type      encodings count     avg size   nulls   min / max
a.key_value.key                    BINARY    Z _ R     6         16.00 B            
a.key_value.value.key_value.key    INT32     Z _ R     9         10.44 B            
a.key_value.value.key_value.value  BOOLEAN   Z   _     9         5.22 B             
b                                  INT32     Z _ R     6         9.17 B             
c                                  DOUBLE    Z _ R     6         9.83 B

@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 14, 2026

Nice
party-parrot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make path_in_schema optional

4 participants